fix(install): limit preparer concurrency to prevent file handle exhaustion#17633
fix(install): limit preparer concurrency to prevent file handle exhaustion#17633denyszhak wants to merge 7 commits intoastral-sh:mainfrom
Conversation
|
Not sure why test job is failing, seems unrelated to the change at first. |
|
Yep the failure is unrelated: #17637 |
|
Interesting, I thought we were already using limits early but it seems we aren't. CC @charliermarsh for the preparer code. |
2468a97 to
949397f
Compare
|
|
||
| // Acquire the concurrency permit and advisory lock. | ||
| let _permit = self.acquire_concurrency_permit().await; | ||
| let _lock = lock_shard.lock().await.map_err(Error::CacheLock)?; |
There was a problem hiding this comment.
I moved the lock to later on purpose. If I had just added the semaphore where the old lock was, it would also start limiting revision and download work on some paths The tradeoff is that it could potentially duplicate a bit of earlier work, but we still re-check under lock before building, so final cache correctness stays the same.
There was a problem hiding this comment.
url_revision reads and writes cache_shard.entry(HTTP_REVISION). Doesn't that mean there can now be a race as two processes read or write this file? I don't think we're completely on the wrong path here, it's just that this code is tricky for ensuring that all operations that two different processes can do are either atomic replaces (e.g., building a directory and putting it in place) or protected by a lock, and that there can't be TOCTOU problems between processes (such as process A reading that something is fitting from the metadata, then process B atomically replaces the directory, and then B reads the actual content of the switched thing that isn't fitting anymore)
There was a problem hiding this comment.
There can be a race on HTTP_REVISION but it's not a harmful one. It's not TOCTOU because we read url_revision once, and we keep it in memory, and then use revision.id() for the build. We do not re-read HTTP_REVISION later, so we don’t “check one revision, then use another".
I agree this concern is valid so I’m just trying to avoid it here.
Let me actually do more testing on this specific concern and capture some state to confirm this is harmless, like running two separate uv sync so they will share the same cache.
There was a problem hiding this comment.
I'm thinking about the case of two uv processes specifically, which without a lock, could read and write url_revision in parallel, if i read it correctly.
There was a problem hiding this comment.
Yes, I think that read is correct. That part was intentional, the goal here was to push the semaphore/lock boundary as far down as possible so we don’t put the earlier revision/download/extract work behind semaphore as well.
I agree this is a race, but the behavior I expect from it is duplicate early work (multiple revision shards), not a later “checked one revision, built another”.
So I think the local tradeoff is:
- lock/semaphore later: lets more of the revision/download work run in parallel, but two processes can duplicate some work for the same source
- lock earlier: avoids that race on the revision, but puts more of the earlier IO behind new semaphore.
2 is the safer choice. My hesitation was mainly the earlier concern about adding more serialization around the IO-heavy part of the work.
On the test side, I added an integration test that runs two separate uv processes against the same direct-URL dependency and shared cache, with a delayed local HTTP response to widen the race window. It verifies both succeed, then runs a third uv lock --offline against the same cache. That last step checks that the cache left behind by the concurrent race is still reusable afterward. In this test, the result looked like duplicate early work rather than an unusable or obviously corrupted cache.
denyszhak@807a93b - that's how it looks
There was a problem hiding this comment.
Ahh, my re-check "under the lock" is stupid
There was a problem hiding this comment.
There's two constraints we need to look for: For all the IO operations in the cache and in venvs, we need to ensure that either all writes are atomic (create a file or directory, then rename it to the target) or we hold a lock while we do them. Additionally, we need to make sure that applies to when the data is streched across e.g. a cache info file and a cache data file. What I'm trying to figure out is, can we statically, from the code, assert that that's the case here, or do we need the lock for the larger scope again?
There was a problem hiding this comment.
I think the first constraint is mostly satisfied here, but not the second. The second looks possible to satisfy too, but it would take a bit more restructuring. Let me get this version implemented first, and then we can decide whether we want to push it further.
There was a problem hiding this comment.
For archives, I changed the pre-lock revision to be temporary only. Under the lock, we now pick or write the canonical LOCAL_REVISION, and then use that canonical revision for later, so the archive state is coherent.
For URLs, I only changed the post-lock behaviour: unnder the lock we now re-read HTTP_REVISION and use that canonical revision for the later path. I did not move the HTTP_REVISION write itself under the lock, because that write currently happens inside the cached client, and doing that cleanly would require a deferred-write refactor, which seems too broad.
If this ends up being the direction we want, I could follow-up with cleanup for repeated “re-scope to canonical revision and re-check under lock” code, but I wanted to keep this change focused on the coordination fix itself first.
1308c32 to
1484433
Compare
Summary
Resolves #17512
Fixes an issue where uv could exhaust the operating system's file descriptor limit when installing projects with a large number of local dependencies, causing a "Too many open files (os error 24)" crash.
Test Plan
ulimit -n 256to set number of allowed descriptors to a low valuemkdir -p /tmp/airflow-testgit clone --depth 1 https://github.com/apache/airflow /tmp/airflow-testcd /tmp/airflow-test/Users/denyszhak/personal-repos/uv/target/debug/uv syncBug: Hit
Too many open files (os error 24) at path "/Users/denyszhak/.cache/uv/sdists-v9/editable/136c133a3434a0fa/.tmpZtqKQt"/Users/denyszhak/personal-repos/uv/target/debug/uv syncFix: Successfully
Resolved 922 packages in 10.82s